1 The Gapminder data

Gapminder is an independent Swedish foundation with no political, religious or economic affiliations, providing many datasets on the state of the world around us. We will be wrangling with the Gapminder dataset for this module, provided in the R package gapminder.

The gapminder dataset provides data for 142 countries, with values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.

By the end of this module we will be able to produce a graph exploring income per person over the years for four different countries (US, Iran, China, and Nigeria).

library(tidyverse)
library(lubridate)
library(naniar)
library(gapminder)

gapminder
## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows
gapminder %>% names()
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

1.1 Tibble

You can think of a tibble as an enhanced data frame. One of the main differences between data frames and tibbles is the way they print data. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. Data frames on the other hand will print all rows, which isn’t very useful if your data frame is very big!

We see here that we have a tibble of size 1704 observations relating to 6 features. The data is in tidy format, with variables corresponding to columns, and observations corresponding to rows.

1.2 R Packages

Throughout this module, we will explore a variety of R packages to help us wrangle our data and make the most sense out of it. These are just a selection out of many R packages that exist, and you might not need to use all of them in your analysis for your own data. These packages help us do the following:

  • readr and readxl to help us read in data into R;
  • magrittr to help us write readable R code;
  • dplyr to help us wrangle and obtain essential summaries of our data;
  • tidyr to help us rearrange our data into usable formats;
  • ggplot2 to help us visualise our data;
  • lubridate to help us wrangle dates in R;
  • and naniar, to help us look at missing values in R.

2 Data Wrangling with dplyr

2.1 Subsetting dataframes

  • filter() filter rows that satisfy certain conditions
  • select() select columns by variable names
gapminder %>% filter(year == "1997", continent == "Asia", lifeExp > 75)
## # A tibble: 6 x 6
##   country          continent  year lifeExp       pop gdpPercap
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Hong Kong, China Asia       1997    80     6495918    28378.
## 2 Israel           Asia       1997    78.3   5531387    20897.
## 3 Japan            Asia       1997    80.7 125956499    28817.
## 4 Kuwait           Asia       1997    76.2   1765345    40301.
## 5 Singapore        Asia       1997    77.2   3802309    33519.
## 6 Taiwan           Asia       1997    75.2  21628605    20207.

Looking at rows corresponding to the year 1997 in Asia with life expectancy greater than 75.

gapminder %>% filter(year == "1997", continent == "Asia", lifeExp > 75) %>% select(country, lifeExp) 
## # A tibble: 6 x 2
##   country          lifeExp
##   <fct>              <dbl>
## 1 Hong Kong, China    80  
## 2 Israel              78.3
## 3 Japan               80.7
## 4 Kuwait              76.2
## 5 Singapore           77.2
## 6 Taiwan              75.2

Looking at only the country and life expectancy variable that matches the previous condition.

gapminder %>% filter(year == "1997", continent == "Asia", lifeExp > 75) %>% select(-lifeExp, -continent, -year) 
## # A tibble: 6 x 3
##   country                pop gdpPercap
##   <fct>                <int>     <dbl>
## 1 Hong Kong, China   6495918    28378.
## 2 Israel             5531387    20897.
## 3 Japan            125956499    28817.
## 4 Kuwait             1765345    40301.
## 5 Singapore          3802309    33519.
## 6 Taiwan            21628605    20207.

Using - to deselect variables.

2.2 Mutating dataframes

  • mutate() create new columns by using existing information from other columns
  • arrange() arrange rows within data frame
gapminder <- gapminder %>% mutate(gdp = pop*gdpPercap)

Notice here we assigned this mutated data back to the same variable gapminder. gapminder now has a new variable gdp added, that is the product of the two existing variables pop and gdpPercap.

Showing rows by lowest to highest gdp in 2007 excluding continent, pop and gdpPercap variables.

gapminder %>% filter(year == "2007") %>% arrange(gdp) %>% head() %>% select(-pop, -gdpPercap, -continent, -year)
## # A tibble: 6 x 3
##   country               lifeExp         gdp
##   <fct>                   <dbl>       <dbl>
## 1 Sao Tome and Principe    65.5  319014077.
## 2 Comoros                  65.2  701111696.
## 3 Guinea-Bissau            46.4  852652874.
## 4 Djibouti                 54.8 1033689705.
## 5 Gambia                   59.4 1270911775.
## 6 Liberia                  45.7 1323912407.
head(gapminder)
## # A tibble: 6 x 7
##   country     continent  year lifeExp      pop gdpPercap          gdp
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
## 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
## 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
## 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
## 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
## 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.

Same data but in descending order

gapminder %>% filter(year == "2007") %>% arrange(desc(gdp)) %>% head() %>% select(-pop, -gdpPercap, -continent, -year)
## # A tibble: 6 x 3
##   country        lifeExp     gdp
##   <fct>            <dbl>   <dbl>
## 1 United States     78.2 1.29e13
## 2 China             73.0 6.54e12
## 3 Japan             82.6 4.04e12
## 4 India             64.7 2.72e12
## 5 Germany           79.4 2.65e12
## 6 United Kingdom    79.4 2.02e12

2.3 Grouping and summarising dataframes

  • group_by() adds extra structure to your dataset, and can be useful if you want to apply a function independently within groups of observations (where the groups are specified by a categorical variable in your data frame). All functions that you apply to the grouped data frame are applied separately to each group until you specify an ungroup() function.
  • summarise() defines variables that are summary statistics of others.
gapminder %>% group_by(continent)
## # A tibble: 1,704 x 7
## # Groups:   continent [5]
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # … with 1,694 more rows

The key difference is now the addition of Groups: continent [5] to the output. In the background, the group_by() function has essentially worked by splitting your data frame into 5 separate data frames based on the categorical variable you specify, in this case, continents.

tally() counts rows, combined with group_by() we can count how many rows there are for each continent group:

gapminder %>% group_by(continent) %>% tally()
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24

This will show how many rows have lifeExp higher than the average lifeExp for their continent group:

gapminder %>% group_by(continent) %>% filter(lifeExp > mean(lifeExp))  %>% tally()
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      282
## 2 Americas    176
## 3 Asia        216
## 4 Europe      189
## 5 Oceania      10

If we had just filtered rows without group_by(), it will calculate the average lifeExp for the entire gapminder dataset instead of respective continent groups:

gapminder %>% filter(lifeExp > mean(lifeExp)) %>% count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa       80
## 2 Americas    218
## 3 Asia        224
## 4 Europe      349
## 5 Oceania      24

This shows the average lifeExp and total gdp over years.

gapminder %>% group_by(year)  %>% summarise(mean_lifeExp = mean(lifeExp), total_gdp = sum(gdpPercap*pop))
## # A tibble: 12 x 3
##     year mean_lifeExp total_gdp
##    <int>        <dbl>     <dbl>
##  1  1952         49.1   7.04e12
##  2  1957         51.5   8.90e12
##  3  1962         53.6   1.10e13
##  4  1967         55.7   1.42e13
##  5  1972         57.6   1.84e13
##  6  1977         59.6   2.23e13
##  7  1982         61.5   2.54e13
##  8  1987         63.2   3.01e13
##  9  1992         64.2   3.45e13
## 10  1997         65.0   4.10e13
## 11  2002         65.7   4.73e13
## 12  2007         67.0   5.81e13

summarise_at() provide summaries of many variables at the same time. Here we want summaries of the minimum and maximum life expectancy and total gdp.

gapminder %>% 
  filter(year %in% c(1977, 2002)) %>%
  group_by(continent, year) %>%
  summarise_at(vars(lifeExp, gdpPercap), list(~min(.), ~max(.)))
## # A tibble: 10 x 6
## # Groups:   continent [5]
##    continent  year lifeExp_min gdpPercap_min lifeExp_max gdpPercap_max
##    <fct>     <int>       <dbl>         <dbl>       <dbl>         <dbl>
##  1 Africa     1977        36.8          502.        67.1        21951.
##  2 Africa     2002        39.2          241.        75.7        12522.
##  3 Americas   1977        49.9         1874.        74.2        24073.
##  4 Americas   2002        58.1         1270.        79.8        39097.
##  5 Asia       1977        31.2          371         75.4        59265.
##  6 Asia       2002        42.1          611         82          36023.
##  7 Europe     1977        59.5         3528.        76.1        26982.
##  8 Europe     2002        70.8         4604.        80.6        44684.
##  9 Oceania    1977        72.2        16234.        73.5        18334.
## 10 Oceania    2002        79.1        23190.        80.4        30688.

3 Reshaping and Basic Visualisation with tidyr and ggplot2

3.1 Reshaping dataframes

  • pivot_longer() tidies datasets and makes them longer by rearranging variables in the columns and observation in rows.

Graph of gdpPercap over years for US, China, Nigeria and Iran

gapminder %>% 
  filter(country == "United States" | country == "China" | country == "Nigeria"| country == "Iran") %>% 
  ggplot(aes(x = year, y = gdpPercap, colour = country))+ geom_line() + scale_y_log10() + ggtitle("gdpPercap from 1950 - 2007")

However, applying dplyr functions is not always enough to be able to get data into a form ready for visualisation.

Suppose we want to plot the following:

group_data <- gapminder %>% group_by(year) %>% summarise(mean_lifeExp = mean(lifeExp), total_gdp = sum(gdpPercap*pop))

We want to plot both the average life expectancy and total gdp as lines on the same graph. We could try and pipe the summarised data frame directly into the ggplot command, however, we have no clear way to map both average life expectancy AND total gdp to the y variable, and year on the x-axis.

group_data %>% pivot_longer(-year, names_to = "summary", values_to = "values") 
## # A tibble: 24 x 3
##     year summary       values
##    <int> <chr>          <dbl>
##  1  1952 mean_lifeExp 4.91e 1
##  2  1952 total_gdp    7.04e12
##  3  1957 mean_lifeExp 5.15e 1
##  4  1957 total_gdp    8.90e12
##  5  1962 mean_lifeExp 5.36e 1
##  6  1962 total_gdp    1.10e13
##  7  1967 mean_lifeExp 5.57e 1
##  8  1967 total_gdp    1.42e13
##  9  1972 mean_lifeExp 5.76e 1
## 10  1972 total_gdp    1.84e13
## # … with 14 more rows

We now have a much longer data frame (twice as many rows as before), with now all the values stored in a variable value and the summary of these values now in summary.

This is now perfect for ggplot, as the variable value can directly be used for the y variable, and two lines can be generated using the summary variable. We now pipe this directly into ggplot:

group_data %>% 
  pivot_longer(-year, names_to = "summary", values_to = "values") %>% 
  ggplot(aes(x= year, y = values, color= summary)) + geom_line() + facet_wrap(.~summary, scales = "free_y")

  • pivot_wider() is the opposite of pivot_longer()

First, define long_form as the long format of our data grouped by year.

long_form <- group_data %>% pivot_longer(-year, names_to = "summary", values_to = "values") 

We can then convert back to wide form by telling R that the column names belong in the variable summary, and the values are in a variable called values. And tada! Our original data frame is back!

long_form %>% 
  pivot_wider(names_from = summary, values_from = values)
## # A tibble: 12 x 3
##     year mean_lifeExp total_gdp
##    <int>        <dbl>     <dbl>
##  1  1952         49.1   7.04e12
##  2  1957         51.5   8.90e12
##  3  1962         53.6   1.10e13
##  4  1967         55.7   1.42e13
##  5  1972         57.6   1.84e13
##  6  1977         59.6   2.23e13
##  7  1982         61.5   2.54e13
##  8  1987         63.2   3.01e13
##  9  1992         64.2   3.45e13
## 10  1997         65.0   4.10e13
## 11  2002         65.7   4.73e13
## 12  2007         67.0   5.81e13

3.2 Gapminder Life Expectancy Plot

Investigate life expectancy in Europe from 1952 - 2007

gapminder %>% filter(continent == "Asia") %>% group_by(year) %>% summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp)) %>% pivot_longer(-year, names_to = "summary", values_to = "values") %>% ggplot(aes(x = year, y = values, color = summary)) + geom_line(size = 1.25) + ggtitle("Minimum and Maximum Life Expectancies in Asia, 1950 - 2007")

4 Pima-Indians Diabetes Dataset

4.1 Missing Values

select_if(is.numeric) selects columns that are of the numeric class.

library(purrr)
library(tidyr)
library(mlbench)

data(PimaIndiansDiabetes)
head(PimaIndiansDiabetes)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74       0       0 25.6    0.201  30      neg
PimaIndiansDiabetes %>%
  select_if(is.numeric) %>% 
  pivot_longer(everything(),names_to = "variable", values_to = "value") %>% 
  ggplot(aes(value)) +
  facet_wrap(~ variable, scales = "free") +
  geom_histogram() +
  theme_bw(base_size = 18)

The histograms above reveal implicit missing values in glucose, pressure, triceps, insulin, and mass (since they can’t be zero!). So we convert these explicitly to missing values.

tib <- as_tibble(PimaIndiansDiabetes)
is.zero <- function(x) { return(x==0); }

# Replace implicit missing values with NAs
dat <- tib %>%
  mutate_at(which(colnames(tib)%in%c("glucose","pressure","triceps","insulin","mass")),
            funs(replace(., is.zero(.), NA)))

# miss_var_summary(dat) summarise the amount of missing values in each variables.
miss_var_summary(dat)
## # A tibble: 9 x 3
##   variable n_miss pct_miss
##   <chr>     <int>    <dbl>
## 1 insulin     374   48.7  
## 2 triceps     227   29.6  
## 3 pressure     35    4.56 
## 4 mass         11    1.43 
## 5 glucose       5    0.651
## 6 pregnant      0    0    
## 7 pedigree      0    0    
## 8 age           0    0    
## 9 diabetes      0    0
# gg_miss_var(dat) shows the relative amounts of missingness.
gg_miss_var(dat)

# vis_miss(dat) shows snapshot of overall amount of missingness in the data
vis_miss(dat) 

4.2 Outliers and identifying skewness

We can check for skewness by first looking at numerical summaries of the median and mean for the variables in our data.

dat %>%
  select_if(is.numeric) %>% 
  pivot_longer(everything(),names_to = "variable", values_to = "value") %>% 
  group_by(variable) %>% 
  summarise(mean = mean(value, na.rm = TRUE), median = median(value, na.rm = TRUE))
## # A tibble: 8 x 3
##   variable    mean  median
##   <chr>      <dbl>   <dbl>
## 1 age       33.2    29    
## 2 glucose  122.    117    
## 3 insulin  156.    125    
## 4 mass      32.5    32.3  
## 5 pedigree   0.472   0.372
## 6 pregnant   3.85    3    
## 7 pressure  72.4    72    
## 8 triceps   29.2    29

We can also inspect the boxplots of these variables to visualise skewness, and to directly identify outliers.

dat %>%
  select_if(is.numeric) %>% 
  pivot_longer(everything(),names_to = "variable", values_to = "value") %>% 
  ggplot(aes(value, x = variable)) +
  facet_wrap(~ variable, scales = "free") +
  geom_boxplot() +
  theme_bw(base_size = 18)

We can see that outliers are present in many of our variables. But what do we do with them? First of all, it is important to note that you can expect outliers in your research and they can be a result of many different factors, including:

  • human error (e.g. typing the wrong value into a dataset)
  • measurement or instrument error
  • natural variation.

If outliers are a result of the first two reasons, then they should be excluded from the dataset. However, if the outliers are a result of natural variation then they should be included to ensure the integrity of the data. In fact, they may be very important to the research.